As you’ve probably learned, you can’t always rely on search engine spiders to do an effective job when they visit and index your website. Left to their own devices bots can generate duplicate content, perceive important pages as junk, index content that shouldnβt serve as a user entry point, and many other issues. There are a number of tools at our disposal that allow us to make the most of bot activity on a website such as the meta robots tag, robots.txt, x-robots-tag, canonical tag and others.
Today, Iβm covering robot control technique conflicts. In an effort to REALLY get their point across, webmasters will sometimes implement more than one robot control technique to keep the search engines away from a page. Unfortunately, these techniques can sometimes contradict each other: One technique hides the instruction of the other or link juice is lost.Β
Β
What happens when a page is disallowed in the robots.txt file and has a noindex meta tag in place? How about a noindex tag and a canonical tag?
Β
Quick Refresher
Before we get into the conflicts, letβs go over each of the main robot access restriction techniques as a refresher.
Β
Meta Robots Tag
The Meta robots tag creates page-level instructions for search engine bots. The Meta robots tag should be included in the head section of the HTML document and might look like this:
Β
Article Print Page
Below is a table of the generally supported commands along with a description of their purpose.
Command |
Description |
NOINDEX |
Prevents the page from being included in the index |
NOFOLLOW |
Prevents bots from following the links on a page |
NOARCHIVE |
Prevents a cached copy of the page from being available in the search results |
NOSNIPPET |
Prevents a description from appearing below the page link in the search results AND prevents caching of the page |
NOODP |
Prevents the Open Directory Project (DMOZ.org) description of the page from being displayed in the search results |
NODIR |
Prevents Yahoo! Directory titles and descriptions for the page from being displayed in the search results |
Β
Canonical Tag
Β
The canonical tag is a page level meta tag that is placed in the HTML header of a webpage. It tells the search engines which URL is the canonicalΒ version of the page being displayed. Its purpose is to keep duplicate content out of the search engine index while consolidating your pages strength into one βcanonicalβ page.
Β
The code looks like this:
Β
Β
X-Robots-Tag
Β
Since 2007 Google and other search engines have supported the X-Robots-Tag as a way to inform the bots about crawling and indexing preferences in the HTTP Header used to serve the file. The X-Robots-Tag is very useful for controlling indexation of non-HTML media types such as PDF documents.
As an example, if a page is to be excluded from the search index the directive would look like this:
Β
X-Robots-Tag: noindex
Β
Robots.txt
Β
Robots.txt allows for some control of search engine robot access to a site, however it does not guarantee a page wonβt be crawled and indexed. It should be employed only when necessary, and no robots should be blocked from crawling an area of the site unless there are solid business and SEO reasons to do so. I almost always recommend using the Meta tag βnoindexβ for keeping pages out of the index instead.
Β
Β
Avoiding Conflicts
It is a bad idea to use any two of the following robot access control methods at once.
- Meta Robots ‘noindex’
- Canonical Tag (when pointing to a different URL)
- Robots.txt Disallow
- X-Robots-Tag
In spite of your strong desire to really keep a page out of the search results, one solution is always better than two. Let’s take a look at what happens when you have various combinations of robot access control techniques in place for a single URL.
Β
Meta Robots ‘noindex’ & Canonical Tag
Β
Meta Robots ‘noindex’ & X-Robots-Tag ‘noindex’
Β
These tags are redundant. I canβt see any way that having both in place for the same page would directly cause damage to your SEO. If you can alter the head of a document to implement at meta robots βnoindexβ, you shouldnβt be using the x-robots-tag anyway.
Β
Robots.txt Disallow & Meta Robots ‘noindex’
Β
This is the most common conflict I see.
Β
The reason I love the meta robots βnoindexβ tag is that it is effective at keeping pages out of the index, yet it can still pass value from the no-indexed page to deeper content that is linked from it. This is a win-win and no link love is lost.
Β
Β
If both protocols are in place, the robots.txt ensures that the meta robots βnoindexβ is never seen. Youβll get the effect of a robots.txt disallow entry and miss out on all the meta robots βnoindexβ goodness.
Β
Below I’ll take you through a simple example of what happens when these two protocols are implemented together.
Β
Here is a screenshot from the Google SERP for a page that is disallowed in the robots.txt and also has a meta robots ‘noindex’ in place. The fact that it is in Google’s index at all is your first clue of a problem.
Β
Β
Here you can see the meta robots ‘noindex’ page. Too bad the search engines can’t see it.
Β
Β
Β
Here you can see that the entire subdomain is disallowed in the robots.txt, ensuring that useful meta robots ‘noindex’ tags are never seen.
Β
Β Β
Β
Assuming mail2web.com is sincere in its desire to keep everything out of the search engines, they’d be better off using the meta robots ‘noindex’ exclusively.
Β
Canonical Tag & X-Robots-Tag ‘noindex’
Β
If you can alter the
of a document, the x-robots-tag likely isnβt your best route for restricting access in the first place. The x-robots-tag works better if you reserve it for non-html file types like PDF and JPEG. If you have both of these in place, Iβd imagine that the search engines would ignore the canonical tag and fail to reassign link value as hoped.Β
If you are able to add a canonical tag to a page, you shouldnβt be using an x-robots-tag.
Β
Canonical Tag & Robots.txt Disallow
Β
If you have a robots.txt disallow in place for a page, the canonical tag will never be seen. No link juice passed. Do not pass go. Do not collect $200. Sorry.
Β
X-Robots-Tag ‘noindex’ & Robots.txt Disallow
Β
Because the x-robots tag exists in the HTTP Response Header, it is possible that these two implementations could intermingle and both be seen by the search engines. However, the statements would be redundant and the robots.txt entry would ensure that no links within the page would be discovered. Once again, we have a bad idea on our hands.
Β
———————————
Β
Bonus Points!
Β
I searched high and low for a live example to share here. I wanted to find a PDF that was both robots.txt disallowed AND noindexed with the x-robots-tag. Sadly, I came up empty handed. I’d have dug around all night, but this post had to go live at some point! Please, I beg you, beat me at my own game.
Β
My process was as follows:
3.Β Β Β Β Β Β Call up the robots.txt disallowed PDF file and check the Response Header for the X-Robots-Tag noindex entry.
Β
Good luck! Let me know when you find one!
Β
———————————-
Β
The concept I’ve been driving at here is fairly straight-forward. Don’t go over-board with your robot control techniques. Choose the best method for the scenario and back away from the machine. You’ll be much better off.
Happy Optimizing!